[WIP] feat: GPU passthrough for actors via gVisor nvproxy + cuda-checkpoint#96
Draft
Davanum Srinivas (dims) wants to merge 3 commits into
Draft
[WIP] feat: GPU passthrough for actors via gVisor nvproxy + cuda-checkpoint#96Davanum Srinivas (dims) wants to merge 3 commits into
Davanum Srinivas (dims) wants to merge 3 commits into
Conversation
Wires the ActorTemplate CRD, atelet's OCI builder, and ateom-gvisor's
runsc invocations end-to-end so a containerised actor can declare GPU
intent and have it survive checkpoint/restore through gVisor's
official cuda-checkpoint path.
Changes:
- pkg/api/v1alpha1/actortemplate_types.go: Container grows an
optional Resources block carrying a GPUResource. Generated
zz_generated.deepcopy.go + ate.dev_actortemplates.yaml regenerated
via go generate.
- internal/proto/ateletpb + ateompb: Container grows a GpuSpec field
mirroring the CRD shape. cmd/atelet projects it from ActorTemplate
into the ateom workload spec.
- cmd/atelet/oci.go::prepareOCIDirectory: when a container requests
GPU, inject /dev/nvidia* device entries (host major/minor) and
bind-mount /usr/local/bin/cuda-checkpoint plus the wrapper script
read-only into the bundle. The pause container never gets GPU.
- cmd/ateom-gvisor/runsc.go: per-sandbox runsc struct learns to emit
--nvproxy, --nvproxy-driver-version, --nvproxy-allowed-driver-
capabilities on create/restore, and --save-restore-exec-argv +
--save-restore-exec-timeout on checkpoint/restore. Without these,
runsc panics with `nvproxy.frontendFDMemmapFile is not saveable`
the moment an actor with live CUDA state is checkpointed.
- hack/cuda-checkpoint-wrapper.sh: the script runsc invokes inside
the sandbox via --save-restore-exec-argv. Enumerates CUDA-touching
PIDs by grepping /proc/*/maps and toggles each via cuda-checkpoint.
Skips only $$ (self), not PID 1 -- the workload is PID 1 inside
the sandbox, and skipping it makes the script find nothing.
Constraints from the gVisor source crawl:
* driver must be R570+ and in runsc nvproxy list-supported-drivers
* driver version must match across checkpoint and restore
* x86_64 only (cuda-checkpoint unsupported on arm64)
* /run/nvidia-persistenced/socket must be a regular file on the
host, because gVisor's gofer can't bind-mount Unix sockets and
nvidia-container-cli hard-codes this bind regardless of
persistenced state.
Verified on an NVIDIA L40S with driver 580.126.09:
nvidia-smi works inside runsc-gpu sandboxes;
cuda-checkpoint --toggle drains a live CUDA context.
go vet ./..., GOOS=linux go build ./cmd/..., go test
./pkg/api/v1alpha1/... all clean.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
Davanum Srinivas (dims)
added a commit
to dims/openshell-driver-substrate
that referenced
this pull request
May 27, 2026
The previous commit (0b46450) landed examples/gpu-counter/ but only the gpu-counter README itself. The repo-root README, docs/poc-intro.md, and the helpdesk demo's "Further reading" still referred to helpdesk as the only demo. This commit adds the cross-references so a reader landing anywhere in the docs can find gpu-counter. Changes: - README.md "Read first" gains a gpu-counter entry; "What's in the box" table gains a row; "Companion changes upstream" table gains agent-substrate/substrate#96 (the load-bearing substrate-side PR). - docs/poc-intro.md "Demo entry point" becomes "Demo entry points" with both helpdesk and gpu-counter listed. "Companion changes" gains the substrate#96 entry. - examples/helpdesk/README.md "Further reading" cross-refs the new sibling and substrate#96. - examples/gpu-counter/README.md expanded ~10x to match helpdesk's depth: a 6-beat table organized as three acts; prereqs + companion-changes-upstream tables; explicit one-time host pre-flight block (persistenced socket replacement, cuda-checkpoint download, wrapper install, kind-node prep); Quick start; What's in this folder; Verified output (excerpt from the 2026-05-27 brev L40S run, including the substrate atelet RPC log excerpts that prove --nvproxy is on every runsc invocation); Troubleshooting matrix of six symptom→fix rows; Cleanup; Open follow-ups (the two items already in the linked impl-log note); Further reading. Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>
…A buffer The original feat/gpu-passthrough commit (c358dff) wired the CRD, proto and runsc flags but the demo only got as far as golden actor Run + Checkpoint; user actor Restore failed with `inconsistent private memory files on restore: savedMFOwners=[pause:/]` and the CUDA buffer in the workload was never observed to survive a substrate suspend/resume cycle. This commit lands the five additional fixes the demo needed on the H100 brev box `front-emerald-krill` (driver 570.195.03, gVisor nightly 2026-05-26). With these, a 1 MiB CUDA buffer set via cuMemsetD8_v2 to byte 0x63 reads back at the same dev_ptr after a `kubectl ate suspend` + idle + `kubectl ate resume` cycle. 1. cmd/atelet/oci.go: add spec.Linux.Resources.Devices allow entries for every nvidia char-device. Without these the OCI bundle gives nvproxy the path but the host's cgroup eBPF device filter denies ioctl access in the sandbox boot path. 2. cmd/atelet/main.go: pass `firstGpuSpec(...)` to the pause container's prepareOCIDirectory too. Previously only the supervisor sub-container got --nvproxy via its OCI spec; runsc create pause launched the sandbox kernel with nvproxy disabled (`--dev-io-fd=-1` in the runsc debug log), so the dev gofer was never wired up and supervisor sub-container ioctls failed inside the sandbox with `nvproxy: failed to open device gofer nvidiactl: devutil.CtxDevGoferClient is not set`. 3. cmd/atelet/oci.go: bind-mount cuda-checkpoint and cuda-checkpoint-wrapper.sh from /run/ateom-gvisor/static-files (the shared HostPath volume) into /usr/local/bin inside the sandbox, falling back to /usr/local/bin on the atelet host. atelet runs inside the kind-control-plane container which doesn't have /usr/local/bin/cuda-checkpoint, so the previous os.Stat silently skipped both mounts. 4. cmd/ateom-gvisor/runsc.go + main.go: add cmdDrainCUDA and cmdUntoggleCUDA helpers that `runsc exec supervisor /usr/local/bin/cuda-checkpoint --toggle --pid 1` before CheckpointWorkload and after RestoreWorkload respectively. gVisor's --save-restore-exec-argv flag runs the binary inside the container being checkpointed (pause for substrate's root sandbox), but pause is the k8s pause image — distroless, no /bin/sh — so wrapper scripts with #!/bin/sh shebangs fail with `failed to load /usr/local/bin/cuda-checkpoint-wrapper.sh: no such file or directory`. Running cuda-checkpoint in the supervisor sub-container instead works because libcuda is there and the supervisor's PID 1 is the workload Python process. 5. cmd/ateom-gvisor/runsc.go: gpuSaveRestoreFlags returns nil and the comment explains why (vs. the previous comment which claimed nvproxy auto-registers; on the gVisor versions we use it does not — there's no auto-registration code anywhere in the source — and explicit registration via the CLI flag conflicts with the external drain in agent-substrate#4). Empirical demo trace (front-emerald-krill, 2026-05-27 15:42 UTC): BEAT3 /set?val=99 → {"ok": true, "val": 99} /sum → {"sum": 405504, "sample": 99, ...} /info → {"dev_ptr": "0x7fe846600000", ...} BEAT4 kubectl ate suspend actor gpu1 → STATUS_SUSPENDED BEAT5 5 s idle BEAT6 kubectl ate resume actor gpu1 → STATUS_RUNNING /info → {"dev_ptr": "0x7fe846600000", ...} ^^^ same address — CUDA context restored /sum → {"sum": 405504, "sample": 99, ...} ^^^ same data — buffer survived suspend Two operational notes for the gpu-counter demo (live in the openshell driver repo): - the workload image must bake the host's `libcuda.so.<host-driver>`; on kind there is no `nvidia-container-cli configure` hook to inject it from the host. The 580.x libcuda from the nvidia/cuda:12.6 base is rejected by nvproxy 570 with cuInit=NO_DEVICE. - the runsc binary substrate uses must be the 2026-05-26 nightly or later; the release-20260520.0 tag has a multi-container nvproxy dev-gofer bug that returns cuInit=NO_DEVICE inside the supervisor sub-container even when pause has --nvproxy. Companion notes: - notes/openshell-on-substrate/2026-05-27-gpu-passthrough-impl-log.md - notes/openshell-on-substrate/2026-05-25-gpu-passthrough-analysis.md
New helper injectNVIDIAAssetsIntoRootfs (cmd/atelet/oci.go) mirrors
the host's NVIDIA driver libs from
/run/ateom-gvisor/static-files/nvidia-libs/ into each new actor's
<rootfs>/usr/lib/x86_64-linux-gnu at sandbox-create time. Real .so
files are copied byte-for-byte; symlinks are recreated as symlinks.
Operators stage those libs once per box (Appendix I of
2026-05-27-gpu-passthrough-runbook.md drops a copy of bigbox's
nvidia-container-cli list --libraries output + the transitive
SONAME / dev symlinks into the kind-node).
Effect: workload images no longer have to COPY libcuda.so.<host-driver>
in their Dockerfile to satisfy dlopen("libcuda.so.1") inside the
gVisor sandbox. This is the substrate-side equivalent of what
nvidia-container-cli configure --compute --utility --device=all does
in the standard docker+nvidia-container-runtime flow.
Why a Go mirror rather than exec'ing nvidia-container-cli configure:
atelet ships on distroless/static-debian13, so it has no dynamic
linker for nvidia-container-cli's libnvidia-container.so.1 dep. The
end state (driver libs at the linker's default search path) is
identical.
Hard-fails if the staging dir is missing or empty so an operator
misconfiguration surfaces immediately instead of crashing inside the
sandbox.
End-to-end verified on bigbox-h200 (NVIDIA H200 NVL, driver
580.159.03) with an unmodified ubuntu:24.04 + python3 workload image
(no libcuda baked in) — full 6-beat suspend/resume preserves
dev_ptr=0x7f9f23e00000 and GPU buffer byte 0xa7.
Collaborator
Benjamin Elder (BenTheElder)
left a comment
There was a problem hiding this comment.
If this is only nvidia GPUs, we should probably name it appropriately instead of "GPU"
What about other CDI Devices? TPUs? At the very least we probably want to leave API shape for this.
| @@ -0,0 +1,20 @@ | |||
| #!/bin/sh | |||
Collaborator
There was a problem hiding this comment.
Currently I don't think we're shipping any other hack/ script to prod. We should probably move this (or replace it with a binary)
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
End-to-end GPU passthrough for substrate actors via gVisor's nvproxy + cuda-checkpoint path.
Changes
CRD —
pkg/api/v1alpha1/actortemplate_types.go::Containergrows an optionalResources *ContainerResourcescarryingGPU *GPUResource{Count, Device, DriverCapabilities, DriverVersion}.zz_generated.deepcopy.goandmanifests/.../ate.dev_actortemplates.yamlregenerated viago generate ./....Protos —
internal/proto/ateletpbandinternal/proto/ateompbContainermessages gain aGpuSpec gpufield mirroring the CRD shape.ateapi → atelet —
cmd/ateapi/internal/controlapi/gpu.goaddstoAteletGpuSpec(*v1alpha1.ContainerResources) *ateletpb.GpuSpec; the resume and suspend workflows populate it on eachateletpb.Container.atelet → ateom —
cmd/atelet/main.goprojectsateletpb.GpuSpec → ateompb.GpuSpecviatoAteomGpuSpec.cmd/atelet/oci.go::prepareOCIDirectorygains agpu *ateletpb.GpuSpecparameter andaddGPUToOCISpec()helper that injects/dev/nvidia*device nodes (host major/minor) intoLinux.Devicesand bind-mounts/usr/local/bin/cuda-checkpoint+ the wrapper script when the workload requests GPU. Pause containers passnil.ateom-gvisor —
cmd/ateom-gvisor/runsc.gogains agpu *ateompb.GpuSpecfield on therunscstruct (populated viafirstGPUSpecfrom the workload's containers).gpuGlobalFlags()emits--nvproxy [--nvproxy-driver-version=X] [--nvproxy-allowed-driver-capabilities=...]onrunsc create/checkpoint/restore.gpuSaveRestoreFlags()is gated to the root container only (the supervisor sub-container restore must not re-invoke the wrapper, and gVisor's nvproxy auto-registers cuda-checkpoint internally onrelease-20260520.0).Wrapper —
hack/cuda-checkpoint-wrapper.sh(20 lines). Idempotentcuda-checkpoint --toggleover every CUDA-touching PID found in/proc/*/maps. Only skips$$(self) — not PID 1, because inside a substrate sandbox the workload is PID 1. Used by the bare-metalvalidate-bare.shin the demo dir; substrate proper leans on gVisor's internal cuda-checkpoint registration.Tests —
cmd/ateom-gvisor/runsc_test.go(linux build tag) coversgpuGlobalFlags,gpuSaveRestoreFlags,firstGPUSpec. The duration-string regression (30s, not30000) is asserted explicitly.go test ./pkg/api/v1alpha1/...clean.Constraints (from the gVisor source crawl)
runsc nvproxy list-supported-drivers. release-20260520.0 supports 16 versions across 535/550/570/580/590.nvidia-persistencedrunning, replace/run/nvidia-persistenced/socketwith a regular file — gVisor's gofer can't bind-mount Unix sockets, andnvidia-container-clihard-codes that mount.Test plan
go vet ./pkg/... ./internal/proto/... ./cmd/atelet/... ./cmd/ateapi/...GOOS=linux go build ./cmd/atelet ./cmd/ateapi ./cmd/ateom-gvisorgo test ./pkg/api/v1alpha1/...